Qualcomm AI Engine Direct - GA Static Gemma3-1B #14108

DannyYuyang-quic · 2025-09-09T07:57:44Z

Summary:

e2e script for GA Static Gemma3-1B
- perf: 16a4w block quant token rate in kv mode: ~= 110 tokens/sec(SM8750), max_seq_len=1024
- acc: PPL ~= (fp:21.375 -> htp:23.086) in wikitext dataset
- add model params config
- add End-to-End example in README
add new architecture:
- add new class to support global/local ROPE static llama architecture required by Gemma3
- enable global/local static llama architecture support in runner
refactoring:
- refactor attention mask to improve integration with global/local ROPE static llama model
- refactor kv_inference and prefill_inference for better readability
Unitest:
- add unit test for Gemma3-1B
- improve readability of memory size constant in unit test
LLM model config visualization
- support tabular LLMmodelConfig visulization

Test plan

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model gemma3-1b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1

pytorch-bot · 2025-09-09T07:57:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14108

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 5 Unrelated Failures

As of commit 9c80e2f with merge base 8496f27 ():

NEW FAILURES - The following jobs have failed:

pull / test-moshi-linux / linux-job (gh)
RuntimeError: Command docker exec -t e2f831ba3c2d50b58e1a0026a0147ea18de93897b4ccd6010e06390aa6d5a756 /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
backends/xnnpack/test/ops/test_conv2d.py::TestConv2d::test_fp16_conv2d

CANCELLED JOB - The following job was cancelled. Please retry:

trunk / test-models-macos-coreml (mv3) / macos-job (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-openvino-linux / linux-job (gh) (similar failure)
##[error]The operation was canceled.

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Build Presets / zephyr (zephyr) / build (gh) (trunk failure)
pull / test-binary-size-linux-gcc / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-samsung-models-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-setup-linux-gcc / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DannyYuyang-quic · 2025-09-09T07:58:44Z

@pytorchbot label "release notes: qualcomm"

DannyYuyang-quic · 2025-09-09T08:14:20Z

Hi @cccclai,
This PR enables support for Gemma3-1B in the static version.

Both accuracy and performance in Hybrid/KV mode are promising.
However, since Gemma3 uses global/local attention mechanism, the implementation of lookahead decoding is a bit tricky, so we temporarily block lookahead decoding only for Gemma3 now, we plan to enable lookahead decoding support for Gemma3 in a future update.

cc: @haowhsu-quic

facebook-github-bot · 2025-09-09T16:49:54Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D82034374.

jackzhxng

Are the changes in examples/models/gemma3 and examples/models/llama relevant?

cccclai · 2025-09-09T17:01:31Z

Are the changes in examples/models/gemma3 and examples/models/llama relevant?

Yes because we reuse the config from the etllm for qualcomm llm models as well

cccclai · 2025-09-09T17:02:17Z

There are some merge conflicts, can you resolve it?

Summary: - e2e script for GA Static Gemma3-1B - perf: 16a4w block quant token rate in kv mode: ~= 110 tokens/sec(SM8750) - acc: PPL ~= (fp:21.375 -> htp:23.086) in wikitext dataset - add model params config - add End-to-End example in README - add new architecture: - add new class to support global/local ROPE static llama architecture required by Gemma3 - enable global/local static llama architecture support in runner - refactoring: - refactor attention mask to improve integration with global/local ROPE static llama model - refactor kv_inference and prefill_inference for better readability - Unitest: - add unit test for Gemma3-1B - improve readability of memory size constant in unit test - LLM model config visualization - support tabular LLMmodelConfig visulization

facebook-github-bot · 2025-09-10T05:01:16Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D82034374.

jackzhxng

Looks good. Trunk errors look like flakes

mergennachin · 2025-09-11T13:56:59Z

@DannyYuyang-quic

There's a bug in this code that was uncovered in our internal testing:

executorch/examples/qualcomm/oss_scripts/llama/model/static_llama.py", line 266, in forward_sha
    attn = attn + atten_mask
           ~~~~~^~~~~~~~~~~~
TypeError: unsupported operand type(s) for +: 'Tensor' and 'AttentionMask'

mergennachin · 2025-09-11T14:00:25Z

atten_mask is sometimes Tensor and sometimes AttenionMask object.

cc @cccclai

mergennachin · 2025-09-11T14:00:57Z

Can you fix this asap, otherwise, I'll revert this PR

haowhsu-quic · 2025-09-11T14:32:35Z

May I know the details of internal test scenario? I tested the latest mainline without masked_softmax and everything works fine.

haowhsu-quic · 2025-09-11T14:42:27Z

The atten_mask is now a wrapper class for causal / sliding attention. We do unwrap it when perform lowering process in llama.py.
Would it be possible you use the inputs from LlamaModel.get_example_inputs() to do the inference? If so, please add asterisk to the attention_mask:

tokens, atten_mask, pos_ids, k_cache, v_cache = model.get_example_inputs()
logits, new_k_caches, new_v_caches = module(
  tokens,
  *atten_mask,
  pos_ids,
  *k_caches,
  *v_caches,
)

cccclai · 2025-09-11T17:32:53Z

I'll try to fix it, looks like just an internal reference need to update

Summary: - e2e script for GA Static [Gemma3-1B](https://huggingface.co/google/gemma-3-1b-it) - perf: 16a4w block quant token rate in kv mode: ~= 110 tokens/sec(SM8750), max_seq_len=1024 - acc: PPL ~= (fp:21.375 -> htp:23.086) in wikitext dataset - add model params config - add End-to-End example in README - add new architecture: - add new class to support global/local ROPE static llama architecture required by Gemma3 - enable global/local static llama architecture support in runner - refactoring: - refactor attention mask to improve integration with global/local ROPE static llama model - refactor kv_inference and prefill_inference for better readability - Unitest: - add unit test for Gemma3-1B - improve readability of memory size constant in unit test - LLM model config visualization - support tabular LLMmodelConfig visulization ### Test plan ``` bash python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --temperature 0 --model_mode hybrid --max_seq_len 1024 --prefill_ar_len 128 --decoder_model gemma3-1b --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 ```

DannyYuyang-quic requested review from cccclai, jackzhxng, kirklandsign, larryliu0820 and lucylq as code owners September 9, 2025 07:57

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 9, 2025

pytorch-bot bot added the release notes: qualcomm Changes to the Qualcomm backend delegate label Sep 9, 2025

jackzhxng reviewed Sep 9, 2025

View reviewed changes

jackzhxng added the ciflow/trunk label Sep 9, 2025

DannyYuyang-quic force-pushed the dev1/danny/GA_static_gemma3 branch from 2333ffc to 9c80e2f Compare September 10, 2025 01:26

pytorch-bot bot removed the ciflow/trunk label Sep 10, 2025

cccclai approved these changes Sep 10, 2025

View reviewed changes

jackzhxng added the ciflow/trunk label Sep 10, 2025

jackzhxng approved these changes Sep 10, 2025

View reviewed changes

cccclai merged commit e2e33c4 into pytorch:main Sep 10, 2025
456 of 472 checks passed

Qualcomm AI Engine Direct - GA Static Gemma3-1B #14108

Qualcomm AI Engine Direct - GA Static Gemma3-1B #14108

Uh oh!

Conversation

DannyYuyang-quic commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

pytorch-bot bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14108

❌ 2 New Failures, 1 Cancelled Job, 5 Unrelated Failures

Uh oh!

DannyYuyang-quic commented Sep 9, 2025

Uh oh!

DannyYuyang-quic commented Sep 9, 2025

Uh oh!

facebook-github-bot commented Sep 9, 2025

Uh oh!

jackzhxng left a comment

Choose a reason for hiding this comment

Uh oh!

cccclai commented Sep 9, 2025

Uh oh!

cccclai commented Sep 9, 2025

Uh oh!

facebook-github-bot commented Sep 10, 2025

Uh oh!

jackzhxng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergennachin commented Sep 11, 2025

Uh oh!

mergennachin commented Sep 11, 2025

Uh oh!

mergennachin commented Sep 11, 2025

Uh oh!

haowhsu-quic commented Sep 11, 2025

Uh oh!

haowhsu-quic commented Sep 11, 2025

Uh oh!

cccclai commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

DannyYuyang-quic commented Sep 9, 2025 •

edited

Loading

pytorch-bot bot commented Sep 9, 2025 •

edited

Loading